{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上一章节，我们给大家简单介绍了赛题的内容和几种解决方案。从本章开始我们将会逐渐带着大家使用思路1到思路4来完成本次赛题。在讲解工具使用的同时，我们还会讲解一些算法的原理和相关知识点，并会给出一定的参考文献供大家深入学习。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Task2 数据读取与数据分析**\n",
    "\n",
    "本章主要内容为数据读取和数据分析，具体使用`Pandas`库完成数据读取操作，并对赛题数据进行分析构成。\n",
    "\n",
    "### **学习目标**\n",
    "\n",
    "- 学习使用`Pandas`读取赛题数据\n",
    "- 分析赛题数据的分布规律\n",
    "\n",
    "### **数据读取**\n",
    "\n",
    "赛题数据虽然是文本数据，每个新闻是不定长的，但任然使用csv格式进行存储。因此可以直接用`Pandas`完成数据读取的操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "train_df = pd.read_csv('../data/train_set.csv', sep='\\t', nrows=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里的`read_csv`由三部分构成：\n",
    "\n",
    "- 读取的文件路径，这里需要根据改成你本地的路径，可以使用相对路径或绝对路径；\n",
    "\n",
    "- 分隔符`sep`，为每列分割的字符，设置为`\\t`即可；\n",
    "- 读取行数`nrows`，为此次读取文件的函数，是数值类型（由于数据集比较大，建议先设置为100）；"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>2967 6758 339 2021 1854 3731 4109 3792 4149 15...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11</td>\n",
       "      <td>4464 486 6352 5619 2465 4802 1452 3137 5778 54...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>7346 4068 5074 3747 5681 6093 1777 2226 7354 6...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>7159 948 4866 2109 5520 2490 211 3956 5520 549...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>3646 3055 3055 2490 4659 6065 3370 5814 2465 5...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   label                                               text\n",
       "0      2  2967 6758 339 2021 1854 3731 4109 3792 4149 15...\n",
       "1     11  4464 486 6352 5619 2465 4802 1452 3137 5778 54...\n",
       "2      3  7346 4068 5074 3747 5681 6093 1777 2226 7354 6...\n",
       "3      2  7159 948 4866 2109 5520 2490 211 3956 5520 549...\n",
       "4      3  3646 3055 3055 2490 4659 6065 3370 5814 2465 5..."
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上图是读取好的数据，是表格的形式。第一列为新闻的类别，第二列为新闻的字符。\n",
    "\n",
    "### **数据分析**\n",
    "\n",
    "在读取完成数据集后，我们还可以对数据集进行数据分析的操作。虽然对于非结构数据并不需要做很多的数据分析，但通过数据分析还是可以找出一些规律的。\n",
    "\n",
    "\n",
    "\n",
    "此步骤我们读取了所有的训练集数据，在此我们通过数据分析希望得出以下结论：\n",
    "\n",
    "- 赛题数据中，新闻文本的长度是多少？\n",
    "- 赛题数据的类别分布是怎么样的，哪些类别比较多？\n",
    "- 赛题数据中，字符分布是怎么样的？\n",
    "\n",
    "\n",
    "\n",
    "#### **句子长度分析**\n",
    "\n",
    "在赛题数据中每行句子的字符使用空格进行隔开，所以可以直接统计单词的个数来得到每个句子的长度。统计并如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n",
      "count     100.000000\n",
      "mean      872.320000\n",
      "std       923.138191\n",
      "min        64.000000\n",
      "25%       359.500000\n",
      "50%       598.000000\n",
      "75%      1058.000000\n",
      "max      7125.000000\n",
      "Name: text_len, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "%pylab inline\n",
    "train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))\n",
    "print(train_df['text_len'].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对新闻句子的统计可以得出，本次赛题给定的文本比较长，每个句子平均由907个字符构成，最短的句子长度为2，最长的句子长度为57921。\n",
    "\n",
    "下图将句子长度绘制了直方图，可见大部分句子的长度都几种在2000以内。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'Histogram of char count')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEWCAYAAABPON1ZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAXSUlEQVR4nO3deZSldX3n8feHbpA1IFA6KGBhQOIWhfQgKnEiLoOKRjNkhLgvp09idCRjxmk0JpqcmSFRiU4mLj0u44jiQtQ44IKKxsEgphtoZR0WWwEFWgybGhD9zh/Pr+BSVHfdgrpVP+j365x76tmf73Pr1qee+3u2VBWSpH5ts9wFSJK2zKCWpM4Z1JLUOYNakjpnUEtS5wxqSeqcQb0VSnJ+kt9a7jqWU5LnJbkiyc1JDlrAfG9OcuIka5NmM6jvY5JsTPLUWcNemuSMmf6qemRVfW2e5UwnqSQrJ1Tqcnsb8Oqq2rmqzlnuYpaT/3z6Z1BrWXTwD+AhwPnLWUAH74HuJQzqrdDoXneSQ5KsS3JjkmuSnNAm+3r7eX1rHnh8km2S/EmS7yW5Nsn/TrLryHJf3MZdl+RNs9bz5iQnJzkxyY3AS9u6z0xyfZIfJvkfSbYbWV4leVWSS5LclOQvkvxqkn9s9X5idPpZ2zhnrUnul+RmYAWwIcllm5n/kUm+lOTH7X15w8jo7drybmrNSKtG5luT5LI27oIkzxsZ99Ik30jy10muA948x3pXJHnDyDLWJ9mnjXtCkn9KckP7+YS5fqcj7/eJrXvm29FLknw/yY+SvLGNOwJ4A/D89nveMNf7oWVWVb7uQy9gI/DUWcNeCpwx1zTAmcCLWvfOwKGtexooYOXIfC8HLgUe2qb9FPDhNu4RwM3AYcB2DE0LPx9Zz5tb/3MZdhB2AH4DOBRY2dZ3IXDsyPoK+HvgV4BHArcAX2nr3xW4AHjJZt6HzdY6suz9NzPvLsAPgdcB27f+x41sx78Az2QI+/8GfHNk3t8FHtS28fnAT4C9Rn4PtwGvadu8wxzr/k/Ad4ADgQCPAfYAdgf+GXhRm/eY1r/HXL/3VueJs36X/7O9749p7+XDZ0/rq8+Xe9T3TZ9pe6nXJ7keeNcWpv05sH+SPavq5qr65hamfQFwQlVdXlU3A8cBR7ev8EcB/6eqzqiqW4E/ZQiHUWdW1Weq6pdV9bOqWl9V36yq26pqI/Be4N/MmuevqurGqjofOA84ra3/BuDzwOYOBG6p1vkcCVxdVW+vqn+pqpuq6qyR8WdU1eeq6hfAhxmCD4Cq+mRV/aBt48eBS4BDRub9QVX9Tdvmn82x7lcCf1JVF9dgQ1VdBzwLuKSqPtzmPQm4CHj2GNsz4y3tfd8AbBitW30zqO+bnltVu828gFdtYdpXAA8DLmpfp4/cwrQPAr430v89hr27B7ZxV8yMqKqfAtfNmv+K0Z4kD0tySpKrW3PIfwX2nDXPNSPdP5ujf+e7Uet89gHmbBJprh7p/imw/cw/gNb8c+7IP8lHcedtutN7sIB1z94eWv+D51nelure3HunzhjUW7mquqSqjgEeAPwlcHKSnbjr3jDADxgOws3Yl+Gr/DUMTQV7z4xIsgPDV/Y7rW5W/7sZ9goPqKpfYWgrzd3fmrFrnc8VDE0mC5LkIQzNC69maJLYjeFbwOg2zXe7yiuAX51j+OztgWGbrmrdPwF2HBn3r8Yse5yatMwM6q1ckhcmmaqqXwLXt8G/BDa1n6OBdRLwR0n2S7Izwx7wx6vqNuBk4NntgNd2DO2e84XuLsCNwM1Jfg34g8Xarnlqnc8pwF5Jjm0HH3dJ8rgx5pv5B7cJIMnLGPaoF+J9wF8kOSCDX0+yB/A54GFJfi/JyiTPZzgucEqb71yGpp1t28HNoxawzmuA6STmQaf8xegI4Px2JsQ7gaNbO+ZPgf8CfKN9jT8U+ABDm+zXge8yHFR7DUBrQ34N8DGGveubgWsZDlptzh8DvwfcxLAn+vFF3K7N1jqfqroJeBpD++/VDO3MTx5jvguAtzMcoL0GeDTwjQXWfQLwCeA0hn9i72c46HgdQ9v56xialF4PHFlVP2rzvYlhT/yfgbcAH13AOj/Zfl6X5OwF1qslkCq/9Wjxtb3Y6xmaNb673PVI92buUWvRJHl2kh1bG/fbGE4z27i8VUn3fga1FtNvMxz0+gFwAEMzil/ZpHvIpg9J6px71JLUuYncFGbPPfes6enpSSxaku6T1q9f/6Oqmppr3ESCenp6mnXr1k1i0ZJ0n5Rk9pWnt7PpQ5I6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHVurKBO8kft2XDnJTkpyfaTLkySNJg3qJM8GPgPwKqqehTDc+KOnnRhkqTBuE0fK4Ed2uOGdmS46Y4kaQnMG9RVdRXDLSu/z3BD+Buq6rTZ0yVZnWRdknWbNm1a/EoXaHrNqUyvOXW5y5Cke2ycpo/7M9y+cj+GB2zulOSFs6erqrVVtaqqVk1NzXm5uiTpbhin6eOpwHeralNV/Rz4FPCEyZYlSZoxTlB/Hzi0PbkjwFOACydbliRpxjht1GcxPGH6bIZHK20DrJ1wXZKkZqzbnFbVnwF/NuFaJElz8MpESeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHXOoJakzhnUktQ5g1qSOmdQS1Lnxnm47YFJzh153Zjk2KUoTpI0xhNequpi4LEASVYAVwGfnnBdkqRmoU0fTwEuq6rvTaIYSdJdLTSojwZOmkQhkqS5jR3USbYDngN8cjPjVydZl2Tdpk2bFqs+SdrqLWSP+hnA2VV1zVwjq2ptVa2qqlVTU1OLU50kaUFBfQw2e0jSkhsrqJPsBDwN+NRky5EkzTbv6XkAVfUTYI8J1yJJmoNXJkpS5wxqSeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHXOoJakzhnUktQ5g1qSOmdQS1Lnxn0U125JTk5yUZILkzx+0oVJkgZjPYoLeCfwhao6Ksl2wI4TrEmSNGLeoE6yK/Ak4KUAVXUrcOtky5IkzRin6WM/YBPwwSTnJHlfeyr5nSRZnWRdknWbNm1a9EJHTa85daLLl6SejBPUK4GDgXdX1UHAT4A1syeqqrVVtaqqVk1NTS1ymZK09RonqK8Erqyqs1r/yQzBLUlaAvMGdVVdDVyR5MA26CnABROtSpJ0u3HP+ngN8JF2xsflwMsmV5IkadRYQV1V5wKrJlyLJGkOXpkoSZ0zqCWpcwa1JHXOoJakzhnUktQ5g1qSOmdQS1LnDGpJ6pxBLUmdM6glqXMGtSR1zqCWpM4Z1JLUOYNakjpnUEtS5wxqSeqcQS1JnRvrCS9JNgI3Ab8Abqsqn/YiSUtk3GcmAjy5qn40sUokSXOy6UOSOjduUBdwWpL1SVbPNUGS1UnWJVm3adOmRSlues2pC55+ofNIUu/GDerDqupg4BnAHyZ50uwJqmptVa2qqlVTU1OLWqQkbc3GCuqquqr9vBb4NHDIJIuSJN1h3qBOslOSXWa6gacD5026MEnSYJyzPh4IfDrJzPQfraovTLQqSdLt5g3qqroceMwS1CJJmoOn50lS5wxqSeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHXOoJakzhnUktQ5g1qSOmdQS1Lnxg7qJCuSnJPklEkWJEm6s4XsUb8WuHBShUiS5jZWUCfZG3gW8L7JliNJmm3cPep3AK8Hfrm5CZKsTrIuybpNmzYtSnHjml5zKtNrTp3IchdjGkm6J+YN6iRHAtdW1fotTVdVa6tqVVWtmpqaWrQCJWlrN84e9ROB5yTZCHwMODzJiROtSpJ0u3mDuqqOq6q9q2oaOBo4vapeOPHKJEmA51FLUvdWLmTiqvoa8LWJVCJJmpN71JLUOYNakjpnUEtS5wxqSeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHXOoJakzhnUktS5cZ5Cvn2SbyXZkOT8JG9ZisIkSYNxHsV1C3B4Vd2cZFvgjCSfr6pvTrg2SRJjBHVVFXBz6922vWqSRUmS7jDWw22TrADWA/sDf1tVZ80xzWpgNcC+++67mDVu0fSaUxc0zcbjn3WX4RuPf9ac3ZtbzpamkaTFNtbBxKr6RVU9FtgbOCTJo+aYZm1VraqqVVNTU4tdpyRttRZ01kdVXQ98FThiMuVIkmYb56yPqSS7te4dgKcBF026MEnSYJw26r2AD7V26m2AT1TVKZMtS5I0Y5yzPr4NHLQEtUiS5uCViZLUOYNakjpnUEtS5wxqSeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpcwa1JHXOoJakzhnUktS5cZ6ZuE+Srya5IMn5SV67FIVJkgbjPDPxNuB1VXV2kl2A9Um+VFUXTLg2SRJj7FFX1Q+r6uzWfRNwIfDgSRcmSRqMs0d9uyTTDA+6PWuOcauB1QD77rvv3S5oes2pW+yfb/h84xbDlmraePyz7vGy7+kyJN23jH0wMcnOwN8Bx1bVjbPHV9XaqlpVVaumpqYWs0ZJ2qqNFdRJtmUI6Y9U1acmW5IkadQ4Z30EeD9wYVWdMPmSJEmjxtmjfiLwIuDwJOe21zMnXJckqZn3YGJVnQFkCWqRJM3BKxMlqXMGtSR1zqCWpM4Z1JLUOYNakjpnUEtS5wxqSeqcQS1JnTOoJalzBrUkdc6glqTOGdSS1DmDWpI6Z1BLUucMaknqnEEtSZ0zqCWpc+M8M/EDSa5Nct5SFCRJurNx9qj/F3DEhOuQJG3GvEFdVV8HfrwEtUiS5rBobdRJVidZl2Tdpk2bFmux99j0mlPv0j/z2tw04yzn7qx/S+ucXdNctc41zVzrG3fZi2WcuuaaZzGmmbS7s23aOk3yc7JoQV1Va6tqVVWtmpqaWqzFStJWz7M+JKlzBrUkdW6c0/NOAs4EDkxyZZJXTL4sSdKMlfNNUFXHLEUhkqS52fQhSZ0zqCWpcwa1JHXOoJakzhnUktQ5g1qSOmdQS1LnDGpJ6pxBLUmdM6glqXMGtSR1zqCWpM4Z1JLUOYNakjpnUEtS5wxqSeqcQS1JnRsrqJMckeTiJJcmWTPpoiRJdxjnmYkrgL8FngE8AjgmySMmXZgkaTDOHvUhwKVVdXlV3Qp8DPjtyZYlSZqRqtryBMlRwBFV9crW/yLgcVX16lnTrQZWt94DgYsXUMeewI8WMP1ystbJsNbJuTfVuzXX+pCqmpprxLxPIR9XVa0F1t6deZOsq6pVi1XLJFnrZFjr5Nyb6rXWuY3T9HEVsM9I/95tmCRpCYwT1P8EHJBkvyTbAUcDn51sWZKkGfM2fVTVbUleDXwRWAF8oKrOX+Q67laTyTKx1smw1sm5N9VrrXOY92CiJGl5eWWiJHXOoJakzi17UPdweXqSDyS5Nsl5I8N2T/KlJJe0n/dvw5Pkv7d6v53k4JF5XtKmvyTJSyZQ5z5JvprkgiTnJ3ltr7W2dWyf5FtJNrR639KG75fkrFbXx9tBapLcr/Vf2sZPjyzruDb84iT/dkL1rkhyTpJTeq6zrWdjku8kOTfJujas18/BbklOTnJRkguTPL7HWpMc2N7PmdeNSY7totaqWrYXw8HJy4CHAtsBG4BHLEMdTwIOBs4bGfZXwJrWvQb4y9b9TODzQIBDgbPa8N2By9vP+7fu+y9ynXsBB7fuXYD/x3BZf3e1tvUE2Ll1bwuc1er4BHB0G/4e4A9a96uA97Tuo4GPt+5HtM/G/YD92mdmxQTq/Y/AR4FTWn+XdbZ1bQT2nDWs18/Bh4BXtu7tgN16rXWk5hXA1cBDeqh1Ihu5gDfj8cAXR/qPA45bplqmuXNQXwzs1br3Ai5u3e8Fjpk9HXAM8N6R4XeabkI1/z3wtHtJrTsCZwOPY7iaa+XszwDDmUWPb90r23SZ/bkYnW4R69sb+ApwOHBKW293dY4seyN3DeruPgfArsB3aScu9FzrrPqeDnyjl1qXu+njwcAVI/1XtmE9eGBV/bB1Xw08sHVvruYl3Zb2dfsghr3UbmttzQnnAtcCX2LYy7y+qm6bY92319XG3wDssUT1vgN4PfDL1r9Hp3XOKOC0JOsz3L4B+vwc7AdsAj7YmpXel2SnTmsddTRwUute9lqXO6jvFWr4t9jNeYxJdgb+Dji2qm4cHddbrVX1i6p6LMMe6yHAry1zSXeR5Ejg2qpav9y1LMBhVXUww10t/zDJk0ZHdvQ5WMnQrPjuqjoI+AlD88HtOqoVgHYs4jnAJ2ePW65alzuoe748/ZokewG0n9e24ZureUm2Jcm2DCH9kar6VM+1jqqq64GvMjQh7JZk5mKr0XXfXlcbvytw3RLU+0TgOUk2Mtwd8nDgnR3Webuquqr9vBb4NMM/wR4/B1cCV1bVWa3/ZIbg7rHWGc8Azq6qa1r/ste63EHd8+XpnwVmjta+hKE9eGb4i9sR30OBG9rXoi8CT09y/3ZU+Olt2KJJEuD9wIVVdULPtbZ6p5Ls1rp3YGhPv5AhsI/aTL0z23EUcHrbg/kscHQ722I/4ADgW4tVZ1UdV1V7V9U0w2fw9Kp6QW91zkiyU5JdZroZfn/n0eHnoKquBq5IcmAb9BTggh5rHXEMdzR7zNS0vLVOqjF+AY32z2Q4e+Ey4I3LVMNJwA+BnzPsAbyCoc3xK8AlwJeB3du0YXiQwmXAd4BVI8t5OXBpe71sAnUexvC169vAue31zB5rbev4deCcVu95wJ+24Q9lCLBLGb5e3q8N3771X9rGP3RkWW9s23Ex8IwJfhZ+izvO+uiyzlbXhvY6f+bvpuPPwWOBde1z8BmGMyF6rXUnhm9Hu44MW/ZavYRckjq33E0fkqR5GNSS1DmDWpI6Z1BLUucMaknqnEGtRZdkj5E7kF2d5KqR/u0WsJzdk/z+Atd9YpLnLrzq5ZHkd5J0d7Wm+rJoTyGXZlTVdQznzpLkzcDNVfW2u7Go3YHfZ7hz3cS1C4pSVb+cd+LF8zsM9xe5aAnXqXsZ96i1pNp9er/V9q7flWSbdmXqJW0PekWSf0xyOHA8MHOP4OPnWNbL2n2ANyT54MioJ7dlXJ7keW3aX0lyepKz2zxHtuH7Z7i/90cYLh7Za9Y6HpfkzLaOs5LsmGSHJB/KcD/os2fus5HklUneMTLvF5IclmRlkuuTHN+Wc2aSByT5TYYLlv66beP0or7Zus9wj1pLJsmjgOcBT6jhoclrGe73/NEkbwfexXC13TlVdXqS7wP713BTp9nLegzwn9uyfpxk95HRD2C4f8ejGe4p/WngZ8Bzq+rGJA8AvsFwO1MYbhT14qpaN2sd2zPc++PfVdXZSXYFbgH+GLilqh6d5JHA55IcMM/m7wr8Q1WtSXIC8PKqOj7J54CTq+oz47yH2joZ1FpKTwX+NbBuaGVgB+64Xeh7kvwu8DKG27fO53CGG/b/uM3/45Fxn6nhkttvJ5m5vWSA45McxtDUsE+SPdu4y2aHdPNw4PtVdXZbxw0AbRlvbcPOT/IDYP956v1ZVX2+da8HfnOMbZQAg1pLK8AHqupNdxkx3Lr1QQxP1tiZ4XaYd9cts9YJ8GKGvdqD2978lQz37OAermvUbdy5OXH7ke5bR7p/gX97WgDbqLWUvgz8+5k92XZ2yL5t3FuBDwJ/zvBEDICbGB45NpfTgefPNHnMavqYy64M95y+LcnTGO9G7hcA+6Y9C6+1c68A/i/wgjbs4Qzt2pcyPHXloHY3tWngN8ZYx5a2UQIMai2hqvoO8Bbgy0m+DZwGPDDJU4DHAG+vqg8B2yR5UQ33A17fDtodP2tZGxieZff1DE+Qees8q/8w8IQk32G4leklY9R7C8MtL9+dZEOr937A3wA7tGV9hKF9+1bgHxjuO3wh8HaGuxvO5yTgDR5M1JZ49zxJ6px71JLUOYNakjpnUEtS5wxqSeqcQS1JnTOoJalzBrUkde7/A6D11rH0AZyuAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "_ = plt.hist(train_df['text_len'], bins=200)\n",
    "plt.xlabel('Text char count')\n",
    "plt.title(\"Histogram of char count\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **新闻类别分布**\n",
    "\n",
    "接下来可以对数据集的类别进行分布统计，具体统计每类新闻的样本个数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 0, 'category')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEZCAYAAACZwO5kAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAZR0lEQVR4nO3de5xdZX3v8c+Xixy538ZwSwhHEEEtlM4JCligAg2QilpUwCooGEWpcmpPDwetID1t0Z7COYUqrwiIolyr2FSuUSSAVSCBhAQBuRhMAoRwDbdWA9/zx1rjazvsTWZmrZlkHr7v12u/Zq1nPfu3nuzMfGftZ69ZS7aJiIhyrbW6BxAREaMrQR8RUbgEfURE4RL0ERGFS9BHRBQuQR8RUbgEfUQHSZa04+oeR0SbEvQx6iQtkvSYpA062o6TdMNqHNZrkqRTJX17dY8jxlaCPsbK2sBnV/cgIl6LEvQxVv4B+EtJm3bbKOnNkmZJelLSvZI+ULfvIOlpSWvV61+X9FjH8y6UdGK9fIykByU9K+mXkj7UY19rSzpZ0gN137mSJnbpd6ikOyStkLRY0qkd2/6LpG9LeqIe322SJrQ1Dkl71TWfqb/u1fG8RZIO6Fj/7VG6pMn19NPRkn4l6XFJn6+3TQVOBj4o6TlJ87uNK8qToI+xMge4AfjLwRvqKZ1ZwEXAG4AjgK9K2tX2L4EVwO/X3f8QeE7SLvX6vsDsusY/AQfb3gjYC5jXYyx/ARwJHAJsDHwMeKFLv+eBjwCbAocCx0t6T73taGATYCKwBfBJ4MU2xiFpc+DKus4WwBnAlZK26FGnm32AnYF3AV+UtIvta4C/Ay61vaHt3YZRL8axBH2MpS8Cfy6pb1D7NGCR7W/YXmn7DuC7wPvr7bOBfSVtVa//S72+A1VADhyZvgy8VdLrbT9i+64e4zgO+ILte12Zb/uJwZ1s32B7ge2Xbd8JXEz1iwXgN1QhvKPtl2zPtb2ipXEcCtxn+8L69bgYuAf4kx51uvmS7Rdtz69fn4T6a1iCPsaM7YXAD4CTBm3aHtizngJ5WtLTwIeAgWCfDexHdTR/I9U7g33rx011ED8PfJDqyPoRSVdKenOPoUwEHljVeCXtKenHkpZLeqauvWW9+ULgWuASSQ9L+oqkdVsaxzbAQ4PaHgK2XdWYOzzasfwCsOEwnhuFSdDHWDsF+Di/G1qLgdm2N+14bGj7+Hr7bOCdVGE/G7gZ2Jt62magiO1rbR8IbE11BPz1HmNYDLxxCGO9CJgJTLS9CXAOoHpfv7H9Jdu7Uk3PTKOa5mljHA9T/fLrNAlYWi8/D6zfsW0rhi6Xq30NStDHmLJ9P3Ap8JmO5h8Ab5L0YUnr1o//NjAPb/s+4EXgz6h+IawAlgF/Sh30kiZIOqyeI/9P4DmqKZRuzgX+RtJOqvxej/nvjYAnbf+HpCnAUQMbJO0v6W2S1qb6DOE3wMstjeOq+vU4StI6kj4I7Fq/TlDN+R9Rv079wOE96nezDJg88OF2vDbkPztWh9OA355Tb/tZ4CCqD2Efppp2+DKwXsdzZgNP2F7csS7g9np9LaoPNx8GnqQ62j+e7s4ALgOuowrp84DXd+n3KeA0Sc9Sfb5wWce2rag+K1gB3F2P58I2xlHP008DPgc8AfwVMM324/Xz/prqncBTwJeo3nkM1eX11yck3f6qPaMYyo1HIiLKliP6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCrbO6B9DNlltu6cmTJ6/uYUREjBtz58593Pbgy4sAa2jQT548mTlz5qzuYUREjBuSBl8247cydRMRUbgEfURE4RL0ERGFS9BHRBQuQR8RUbgEfURE4RL0ERGFS9BHRBRujfyDqV4mn3TlsPovOv3QURpJRMT4kSP6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgo3ri6BMJpyeYWIKFWO6CMiCrfKI3pJ5wPTgMdsv7VuuxTYue6yKfC07d27PHcR8CzwErDSdn9L446IiCEaytTNBcDZwLcGGmx/cGBZ0j8Cz7zK8/e3/fhIBxgREc2sMuht3yhpcrdtkgR8APijdocVERFtaTpH/05gme37emw3cJ2kuZKmv1ohSdMlzZE0Z/ny5Q2HFRERA5oG/ZHAxa+yfR/bewAHA5+W9Ie9OtqeYbvfdn9fX1/DYUVExIARB72kdYD3AZf26mN7af31MeAKYMpI9xcRESPT5Ij+AOAe20u6bZS0gaSNBpaBg4CFDfYXEREjsMqgl3Qx8FNgZ0lLJB1bbzqCQdM2kraRdFW9OgG4WdJ84FbgStvXtDf0iIgYiqGcdXNkj/ZjurQ9DBxSLz8I7NZwfBER0VD+MjYionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwg3lnrHnS3pM0sKOtlMlLZU0r34c0uO5UyXdK+l+SSe1OfCIiBiaoRzRXwBM7dJ+pu3d68dVgzdKWhv4Z+BgYFfgSEm7NhlsREQM3yqD3vaNwJMjqD0FuN/2g7Z/DVwCHDaCOhER0UCTOfoTJN1ZT+1s1mX7tsDijvUldVtERIyhkQb914A3ArsDjwD/2HQgkqZLmiNpzvLly5uWi4iI2oiC3vYy2y/Zfhn4OtU0zWBLgYkd69vVbb1qzrDdb7u/r69vJMOKiIguRhT0krbuWH0vsLBLt9uAnSTtIOl1wBHAzJHsLyIiRm6dVXWQdDGwH7ClpCXAKcB+knYHDCwCPlH33QY41/YhtldKOgG4FlgbON/2XaPyr4iIiJ5WGfS2j+zSfF6Pvg8Dh3SsXwW84tTLiIgYO/nL2IiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionCrvPFItGPySVcOq/+i0w8dpZFExGtNjugjIgq3yqCXdL6kxyQt7Gj7B0n3SLpT0hWSNu3x3EWSFkiaJ2lOmwOPiIihGcoR/QXA1EFts4C32v494BfA/3qV5+9ve3fb/SMbYkRENLHKoLd9I/DkoLbrbK+sV38GbDcKY4uIiBa0MUf/MeDqHtsMXCdprqTpLewrIiKGqdFZN5I+D6wEvtOjyz62l0p6AzBL0j31O4RutaYD0wEmTZrUZFgREdFhxEf0ko4BpgEfsu1ufWwvrb8+BlwBTOlVz/YM2/22+/v6+kY6rIiIGGREQS9pKvBXwLttv9CjzwaSNhpYBg4CFnbrGxERo2cop1deDPwU2FnSEknHAmcDG1FNx8yTdE7ddxtJV9VPnQDcLGk+cCtwpe1rRuVfERERPa1yjt72kV2az+vR92HgkHr5QWC3RqOLiIjGcgmEQuQSCxHRSy6BEBFRuAR9REThEvQREYVL0EdEFC5BHxFRuAR9REThEvQREYVL0EdEFC5BHxFRuAR9REThEvQREYVL0EdEFC5BHxFRuAR9REThEvQREYVL0EdEFC5BHxFRuAR9REThhhT0ks6X9JikhR1tm0uaJem++utmPZ57dN3nPklHtzXwiIgYmqEe0V8ATB3UdhLwI9s7AT+q13+HpM2BU4A9gSnAKb1+IURExOgYUtDbvhF4clDzYcA36+VvAu/p8tQ/BmbZftL2U8AsXvkLIyIiRlGTOfoJth+plx8FJnTpsy2wuGN9Sd32CpKmS5ojac7y5csbDCsiIjq18mGsbQNuWGOG7X7b/X19fW0MKyIiaBb0yyRtDVB/faxLn6XAxI717eq2iIgYI02CfiYwcBbN0cC/dulzLXCQpM3qD2EPqtsiImKMDPX0youBnwI7S1oi6VjgdOBASfcBB9TrSOqXdC6A7SeBvwFuqx+n1W0RETFG1hlKJ9tH9tj0ri595wDHdayfD5w/otFFRERj+cvYiIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicEO68UjE5JOuHFb/RacfOkojiYjhyhF9REThRhz0knaWNK/jsULSiYP67CfpmY4+X2w+5IiIGI4RT93YvhfYHUDS2sBS4IouXW+yPW2k+4mIiGbamrp5F/CA7YdaqhcRES1pK+iPAC7use0dkuZLulrSW3oVkDRd0hxJc5YvX97SsCIionHQS3od8G7g8i6bbwe2t70bcBbw/V51bM+w3W+7v6+vr+mwIiKi1sYR/cHA7baXDd5ge4Xt5+rlq4B1JW3Zwj4jImKI2gj6I+kxbSNpK0mql6fU+3uihX1GRMQQNfqDKUkbAAcCn+ho+ySA7XOAw4HjJa0EXgSOsO0m+4yIiOFpFPS2nwe2GNR2Tsfy2cDZTfYRERHN5C9jIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChcgj4ionAJ+oiIwiXoIyIKl6CPiChc46CXtEjSAknzJM3psl2S/knS/ZLulLRH031GRMTQNbpnbIf9bT/eY9vBwE71Y0/ga/XXiIgYA2MxdXMY8C1XfgZsKmnrMdhvRETQTtAbuE7SXEnTu2zfFljcsb6kbouIiDHQxtTNPraXSnoDMEvSPbZvHG6R+pfEdIBJkya1MKwYTyafdOWw+i86/dDXVP2IJhof0dteWn99DLgCmDKoy1JgYsf6dnXb4DozbPfb7u/r62s6rIiIqDUKekkbSNpoYBk4CFg4qNtM4CP12TdvB56x/UiT/UZExNA1nbqZAFwhaaDWRbavkfRJANvnAFcBhwD3Ay8AH224z4iIGIZGQW/7QWC3Lu3ndCwb+HST/URExMjlL2MjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCJegjIgqXoI+IKFyCPiKicAn6iIjCtXFz8IgYZeP95ubjvf54lyP6iIjCjTjoJU2U9GNJP5d0l6TPdumzn6RnJM2rH19sNtyIiBiuJlM3K4HP2b5d0kbAXEmzbP98UL+bbE9rsJ+IiGhgxEf0th+xfXu9/CxwN7BtWwOLiIh2tDJHL2ky8PvALV02v0PSfElXS3pLG/uLiIiha3zWjaQNge8CJ9peMWjz7cD2tp+TdAjwfWCnHnWmA9MBJk2a1HRYERFRa3REL2ldqpD/ju3vDd5ue4Xt5+rlq4B1JW3ZrZbtGbb7bff39fU1GVZERHRoctaNgPOAu22f0aPPVnU/JE2p9/fESPcZERHD12TqZm/gw8ACSfPqtpOBSQC2zwEOB46XtBJ4ETjCthvsMyIihmnEQW/7ZkCr6HM2cPZI9xEREc3lEggREaswnEssrImXV8glECIiCpegj4goXII+IqJwCfqIiMIl6CMiCpegj4goXII+IqJwCfqIiMIl6CMiCpegj4goXC6BEBGxGg3n8gowskss5Ig+IqJwCfqIiMIl6CMiCpegj4goXII+IqJwCfqIiMIl6CMiCtco6CVNlXSvpPslndRl+3qSLq233yJpcpP9RUTE8I046CWtDfwzcDCwK3CkpF0HdTsWeMr2jsCZwJdHur+IiBiZJkf0U4D7bT9o+9fAJcBhg/ocBnyzXv4X4F2S1GCfERExTLI9sidKhwNTbR9Xr38Y2NP2CR19FtZ9ltTrD9R9Hu9SbzowvV7dGbh3GMPZEnhFzZaMZu3UT/3UT/22am9vu6/bhjXmWje2ZwAzRvJcSXNs97c8pFGvnfqpn/qpPxa1m0zdLAUmdqxvV7d17SNpHWAT4IkG+4yIiGFqEvS3ATtJ2kHS64AjgJmD+swEjq6XDweu90jniiIiYkRGPHVje6WkE4BrgbWB823fJek0YI7tmcB5wIWS7geepPplMBpGNOWzBtRO/dRP/dQf9doj/jA2IiLGh/xlbERE4RL0ERGFS9BHRBRujTmPfqgkvRnYFrjF9nMd7VNtX7P6RjY09fgPo/o3QHUK6kzbd7dUfwpg27fVl6SYCtxj+6o26nfZ37dsf2Q0ardN0meAK2wvHqX6ewJ3214h6fXAScAewM+Bv7P9zGjsty2S/ivwPqpTol8CfgFcZHtFC7UHzsx72PYPJR0F7AXcDcyw/Zum+xi0v32o/np/oe3r2qw9Ho2rD2PrH9RPU31z7A581va/1ttut73HKO//o7a/0eD5/xM4kupyEUvq5u2ofgAusX16w/GdQnXtoXWAWcCewI+BA4Frbf9tw/qDT58VsD9wPYDtdzepP9okPQM8DzwAXAxcbnt5i/XvAnarz0ibAbxAfemPuv19be2rbfXP1jTgRuAQ4A7gaeC9wKds39Cw/neovi/Xr+tuCHyP6rWR7aNf5elDqX+r7Sn18sepcuIK4CDg35r+bI17tsfNA1gAbFgvTwbmUIU9wB1jsP9fNXz+L4B1u7S/DrivpddnbaofphXAxnX764E7W6h/O/BtYD9g3/rrI/XyvqP82l/dQo07qKYrD6I69Xc5cA3V33ps1EL9uztfq0Hb5rVQfxPgdOAeqtOVn6A66Dkd2LSN7516eX3ghnp5Uhs/WwPff1Rhv6xjX2rpe/OOjuXbgL56eQNgQUvfgxsDfw9cCBw1aNtXW6i/FfA1qotFbgGcWv+/XAZs3aT2eJujX8v1dI3tRVRBc7CkM6i+YRqTdGePxwJgQsPyLwPbdGnfut7W1ErbL9l+AXjA9Vtu2y+2VL8fmAt8HnjG1VHei7Zn257dtLikPXo8/oDqHVxTtv2y7etsH0v1f/FVqumtB1uov1DSR+vl+ZL6ASS9CWhjauIy4ClgP9ub296C6h3VU/W2pgamctejOuLG9q+AdVuovVY9fbMR1S+STTr21Vb9zSRtQfUOYTmA7eeBlS3UB/gGVc58FzhC0nclrVdve3sL9S+gmuZbTPVO/EWqd1c3Aec0KTze5uiXSdrd9jwA289JmgacD7ytpX1MAP6Y6oenk4B/b1j7ROBHku6j+s+E6ohpR+CEns8aul9LWr8O+j8YaJS0CS0Eve2XgTMlXV5/XUa730O3AbPp/kt70xbq/05dV/PCM4GZktZvof5xwP+T9AWqi1H9VNJiqv/r41qoP9n271zq2/ajwJclfaxh7XOB2yTdAryT+pLikvqo3j00dR7VO5G1qQ4ULpf0IFVAXtJC/U2oDkIEWNLWth+RtCEtHQQCb7T9p/Xy9yV9HrheUltTlhNsnwUg6VMd/9dnSTq2SeHxNke/HdVR66Ndtu1t+yct7OM84Bu2b+6y7SLbRzWsvxbVh0SdH8beZvulJnXr2uvZ/s8u7VtSvfVb0HQfg+oeCuxt++SW6i0E3mv7vi7bFtue2OVpw6n/Jtu/aFJjiPvZGNiB6pfgEtvLWqp7HfBD4JsDNSVNAI4BDrR9QMP6bwF2ofoA856Gw+1WfxsA2w9L2hQ4gGo69Na299Wxz/WpAvSXLdS6G3hLfcAz0HYM8D+oppS3b1h/vu3d6uX/bfsLHdsW2B7xwey4CvooW33p6wW2X3GJaknvsf391TCsNYakzajO5DkMeEPdvIzqXcnptge/C40WSfoKcJ3tHw5qnwqcZXunhvVPA77ijrMJ6/Ydqf5/Dx9x7QR9jAdNz3gqXV6f1Wu0X//GZ/wl6GM8kPQr25NW9zjWVHl9Vq/Rfv2b1h9vH8ZGwSTd2WsTzc94Gvfy+qxeo/36j2b9BH2sSUbzjKcS5PVZvUb79R+1+gn6WJP8gOrshXmDN0i6YeyHs8bJ67N6jfbrP2r1M0cfEVG48faXsRERMUwJ+oiIwiXo4zVP0n6S9lrd44gYLQn6iOrieKMa9Krk5y1Wi3zjRbEkfaS+8uh8SRdK+hNJt0i6Q9IPJU2QNBn4JPDfJc2T9E5JffWVCW+rH3vX9fokzZJ0l6RzJT1UX0cISX8haWH9OLFumyzpXknfAhYCfy3p/3aM7+OSzhzr1yVee3LWTRSpvkDXFcBeth+XtDlg4GnblnQcsIvtz0k6FXjO9v+pn3sR1fXFb5Y0ieqmLbtIOhtYavvv6+ubXA30AdtTXWL27VTnPN8C/BnV+dAP1mP4WX0lxfnAm23/RtK/A59o+2JzEYPlPPoo1R9R3UHqcQDbT0p6G3CppK2pbvbS64qGBwC7Sr+9uu3GdUjvQ3XHJWxfI2ngD1v2obpF4fMAkr5HdanfmcBDtn9WP+c5SdcD0+orIa6bkI+xkKCP15KzgDNsz5S0H9UdfLpZC3i77f/obOwI/uF4ftD6ucDJVNdmz0XIYkxkjj5KdT3wflV3HKKeutmE6vr/UN0+cMCzVHc+GnAd8OcDK5IG7m71E+ADddtBwGZ1+03AeyStL2kDqqP+m7oNyvYtVDffPorqvrURoy5BH0WyfRfwt8BsSfOBM6iO4C+XNJfqDlAD/g1478CHscBngP76g9yfU31YC/Al4KD6BinvBx4FnrV9O9Uc/a1U8/Pn2r7jVYZ3GfCTXD8+xko+jI0YIlX3B33J9kpJ7wC+ZnvY97KV9APgTNs/an2QEV1kjj5i6CYBl9Xnw/8a+PhwnlzfPu9WYH5CPsZSjugjIgqXOfqIiMIl6CMiCpegj4goXII+IqJwCfqIiMIl6CMiCvf/AZZ1yViyHcehAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "train_df['label'].value_counts().plot(kind='bar')\n",
    "plt.title('News class count')\n",
    "plt.xlabel(\"category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在数据集中标签的对应的关系如下：{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}\n",
    "\n",
    "从统计结果可以看出，赛题的数据集类别分布存在较为不均匀的情况。在训练集中科技类新闻最多，其次是股票类新闻，最少的新闻是星座新闻。\n",
    "\n",
    "#### **字符分布统计**\n",
    "\n",
    "接下来可以统计每个字符出现的次数，首先可以将训练集中所有的句子进行拼接进而划分为字符，并统计每个字符的个数。\n",
    "\n",
    "从统计结果中可以看出，在训练集中总共包括6869个字，其中编号3750的字出现的次数最多，编号3133的字出现的次数最少。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2405\n",
      "('3750', 3702)\n",
      "('5034', 1)\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "all_lines = ' '.join(list(train_df['text']))\n",
    "word_count = Counter(all_lines.split(\" \"))\n",
    "word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)\n",
    "\n",
    "print(len(word_count))\n",
    "\n",
    "print(word_count[0])\n",
    "\n",
    "print(word_count[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里还可以根据字在每个句子的出现情况，反推出标点符号。下面代码统计了不同字符在句子中出现的次数，其中字符3750，字符900和字符648在20w新闻的覆盖率接近99%，很有可能是标点符号。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('900', 99)\n",
      "('3750', 99)\n",
      "('648', 96)\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))\n",
    "all_lines = ' '.join(list(train_df['text_unique']))\n",
    "word_count = Counter(all_lines.split(\" \"))\n",
    "word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)\n",
    "\n",
    "print(word_count[0])\n",
    "\n",
    "print(word_count[1])\n",
    "\n",
    "print(word_count[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **数据分析的结论**\n",
    "\n",
    "通过上述分析我们可以得出以下结论：\n",
    "\n",
    "1. 赛题中每个新闻包含的字符个数平均为1000个，还有一些新闻字符较长；\n",
    "2. 赛题中新闻类别分布不均匀，科技类新闻样本量接近4w，星座类新闻样本量不到1k；\n",
    "3. 赛题总共包括7000-8000个字符；\n",
    "\n",
    "通过数据分析，我们还可以得出以下结论：\n",
    "\n",
    "1. 每个新闻平均字符个数较多，可能需要截断；\n",
    "\n",
    "2. 由于类别不均衡，会严重影响模型的精度；\n",
    "\n",
    "### **本章小结**\n",
    "\n",
    "本章对赛题数据进行读取，并新闻句子长度、类别和字符进行了可视化分析。\n",
    "\n",
    "### **本章作业**\n",
    "\n",
    "1. 假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？\n",
    "2. 统计每类新闻中出现次数对多的字符"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}